Significance of Bridging Real-world Documents and NLP Technologies

نویسندگان

  • Tadayoshi Hara
  • Goran Topić
  • Yusuke Miyao
  • Akiko Aizawa
چکیده

Most conventional natural language processing (NLP) tools assume plain text as their input, whereas real-world documents display text more expressively, using a variety of layouts, sentence structures, and inline objects, among others. When NLP tools are applied to such text, users must first convert the text into the input/output formats of the tools. Moreover, this awkwardly obtained input typically does not allow the expected maximum performance of the NLP tools to be achieved. This work attempts to raise awareness of this issue using XML documents, where textual composition beyond plain text is given by tags. We propose a general framework for data conversion between XML-tagged text and plain text used as input/output for NLP tools and show that text sequences obtained by our framework can be much more thoroughly and efficiently processed by parsers than naively tag-removed text. These results highlight the significance of bridging real-world documents and NLP technologies.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cross border E-Science and Research Partnership: Bridging the Gap Between Science and Media

  E-Science is a tool that helps scientists to store, interpret, analyze and make a network of their data, and it can play a critical role in different aspects of the scientific goals and research. This commentary, under the topic of Cross Border E-Science and Research Partnership: Bridging the Gap between Science and Media,[1] attempts to shed light on E-Science with emphasis on three importa...

متن کامل

D(H)ante: A New Set of Tools for XIII Century Italian

In this paper we describe 1) the process of converting a corpus of Dante Alighieri from a TEI XML format in to a pseudo-CoNLL format; 2) how a pos-tagger trained on modern Italian performs on Dante’s Italian 3) the performances of two different pos-taggers trained on the given corpus. We are making our conversion scripts and models available to the community. The two other models trained on the...

متن کامل

A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents

The recent growth of the World Wide Web at increasing rate and speed and the number of online available resources populating Internet represent a massive source of knowledge for various research and business interests. Such knowledge is, for the most part, embedded in the textual content of web pages and documents, which is largely represented as unstructured natural language formats. In order ...

متن کامل

Natural Language Processing Technologies for Document Profiling

Nowadays, search for documents on the Internet is becoming increasingly difficult. The reason is the amount of content published by users (articles, comments, blogs, reviews). How to facilitate that the users can find their required documents? What would be necessary to provide useful document meta-data for supporting search engines? In this article, we present a study of some Natural Language ...

متن کامل

Characterizing the Validity and Real-World Utility of Health Technology Assessments in Healthcare: Future Directions; Comment on “Problems and Promises of Health Technologies: The Role of Early Health Economic Modelling”

With their article, Grutters et al raise an important question: What do successful health technology assessments (HTAs) look like, and what is their real-world utility in decision-making? While many HTAs are published in peer-reviewed journals, many are considered proprietary and their attributes remain confidential, limiting researchers’ ability to answer these questio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014